8 research outputs found
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism
Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci
Towards the detection of cross-language source code reuse
Internet has made available huge amounts of information,
also source code. Source code repositories and, in general, programming
related websites, facilitate its reuse. In this work, we propose a simple
approach to the detection of cross-language source code reuse, a nearly
investigated problem. Our preliminary experiments, based on character
n-grams comparison, show that considering different sections of the
code (i.e., comments, code, reserved words, etc.), leads to different results.
When considering three programming languages: C++, Java, and
Python, the best result is obtained when comments are discarded and
the entire source code is considered.This work has been developed with the support of the project TEXT-ENTERPRISE 2.0: Text comprehension techniques applied to the needs of the Enterprise 2.0 (MICINN, Spain TIN2009-13391-C04-03 (PlanI+D+i)).Flores Sáez, E.; Barrón Cedeño, LA.; Rosso, P.; Moreno Boronat, LA. (2011). Towards the detection of cross-language source code reuse. En Natural Language Processing and Information Systems. Springer Verlag (Germany). 6716:250-253. https://doi.org/10.1007/978-3-642-22327-3_31S2502536716Arwin, C., Tahaghoghi, S.M.M.: Plagiarism Detection across Programming Languages. In: Proceedings of the 29th Australasian Computer Science Conference, vol. 48, pp. 277–286 (2006)Faidhi, J., Robinson, S.: An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput. Educ. 11, 11–19 (1987)Jankowitz, H.T.: Detecting plagiarism in student pascal programs. The Computer Journal 31(1) (1988)Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. Journal of Algorithms 64(1), 51–60 (2009)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Languages Resources and Evaluation. Special Issue on Plagiarism and Authorship Analysis 45(1) (2011)Rosales, F., García, A., Rodríguez, S., Pedraza, J.L., Méndez, R., Nieto, M.M.: Detection of plagiarism in programming assignments. IEEE Transactions on Education 51(2), 174–183 (2008)Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In: Proc. SEPLN 2009, Donostia, Spain, pp. 38–46 (2009
Overview of the 3rd International Competition on Plagiarism Detection
[EN] This paper overviews eleven plagiarism detectors that have been developed
and evaluated within PAN’11. We survey the detection approaches developed
for the two sub-tasks “external plagiarism detection” and “intrinsic plagiarism
detection,” and we report on their detailed evaluation based on the third
revised edition of the PAN plagiarism corpus PAN-PC-11.This work was partly funded by the European Commission as part of the WIQEI IRSES project (grant no. 269180) within the FP7 Marie Curie People Framework, by MICINN as part of the TextEnterprise 2.0 project (TIN2009-13391-C04-03) within the Plan I+D+i, and as part of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Potthast, M.; Eiselt, A.; Barrón Cedeño, LA.; Stein, B.; Rosso, P. (2011). Overview of the 3rd International Competition on Plagiarism Detection. CEUR Workshop Proceedings. 1177. http://hdl.handle.net/10251/46639S117
Cross-language source code re-use detection using latent semantic analysis
[EN] Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional pproaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text ,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models
for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.This work was partially supported by Universitat Polit`ecnica de Val`encia,
WIQ-EI (IRSES grant n. 269180), and DIANA-APPLICATIONS (TIN2012-
38603-C02- 01) project. The work of the fourth author is also supported by
VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Flores Sáez, E.; Barrón-Cedeño, LA.; Moreno Boronat, LA.; Rosso, P. (2015). Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science. 21(13):1708-1725. https://doi.org/10.3217/jucs-021-13-1708S17081725211
Extracting Parallel Corpora from Wikipedia on the basis of Phrase Level Bilingual Alignment
[EN] This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used
word-level alignment models from IBM in order to obtain phrase-level bilingual
alignments between documents pairs. We have manually annotated a set of test
English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.[ES] Este art'¿culo presenta una nueva t'ecnica de extracci'on de corpus paralelos de la Wikipedia mediante la aplicaci'on de t'ecnicas de traducci'on autom'atica
estad'¿stica. En concreto, se han utilizado los modelos de alineamiento basados en
palabras de IBM para obtener alineamientos biling¿ues a nivel de frase entre pares de
documentos. Para su evaluaci'on se ha generado manualmente un conjunto de test
formado por pares de documentos ingl'es-espa¿nol, obteni'endose resultados prometedores.Este trabajo se ha llevado a cabo en el marco del VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems, financiado parcialmente por parte de la EC (FEDER/FSE; WIQEI IRSES no. 269180 / FP 7 Marie Curie People), por el MICINN como parte del proyecto Text-Enterprise 2.0 (TIN2009-13391-C04-03) en el Plan I+D+i, y por la beca 192021 del CONACyT. Tambi´en ha recibido apoyo por parte del EC (FEDER/FSE) y del MEC/MICINN bajo el programa MIPRCV “Consolider Ingenio 2010” (CSD2007-00018) y el proyecto iTrans2 (TIN2009-14511), por el MITyC en el marco del proyecto erudito.com (TSI-020110-2009-439), por la Generalitat Valenciana con las ayudas Prometeo/2009/014 y GV/2010/067, y por el “Vicerrectorado de Investigaci´on de la UPV” con la ayuda 20091027.Silvestre Cerdà, JA.; Garcia Martinez, MM.; Barrón Cedeño, LA.; Civera Saiz, J.; Rosso ., P. (2011). Extracción de Corpus Paralelos de la Wikipedia basada en la Obtención de Alineamientos Bilingües a Nivel de Frase. CEUR Workshop Proceedings. 824:14-21. http://hdl.handle.net/10251/27930S142182
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
Detección automática de plagio en texto
El plagio de texto significa incluir en un documento texto escrito por otra persona sin darle crédito. Hemos probado algunos mètodos existentes y desarrollado dos nuevos para la detección de plagio: uno para la reducción del espacio de búsqueda y otro para detectar plagios de un idioma a otro. Estas subtareas prácticamente no se han abordado antes.Barrón Cedeño, LA. (2008). Detección automática de plagio en texto. http://hdl.handle.net/10251/12186Archivo delegad
Overview of the 4th International Competition on Plagiarism Detection
[EN] This paper overviews 15 plagiarism detectors that have been evaluated
within the fourth international competition on plagiarism detection at PAN 12.
We report on their performances for two sub-tasks of external plagiarism detection:
candidate document retrieval and detailed document comparison. Furthermore,
we introduce the PAN plagiarism corpus 2012, the TIRA experimentation
platform, and the ChatNoir search engine for the ClueWeb. They add scale and
realism to the evaluation as well as new means of measuring performance.This work was partly funded by the EC WIQ-EI project (project no. 269180) within the FP7 People Program, by the MICINN Text-Enterprise (TIN2009-13391-C04-03) research project, and by the ERCIM "Alain Bensoussan" Fellowship Programme (funded from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement number 246016).Potthast, M.; Gollub, T.; Hagen, M.; Graßegger, J.; Kiesel, J.; Michel, ML.; Oberländer, A.... (2012). Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers, 17-20 September. 101-128. http://hdl.handle.net/10251/55282S10112